Data visualization for Starbucks kids drinks nutritional facts

Putting Visual Analytics into Practical Use

Min Xiaoqi https://www.linkedin.com/in/xiaoqi-min/ (Master of IT in Business, Singapore Management University)https://scis.smu.edu.sg/master-it-business/financial-technology-and-analytics-track
2022-02-18

1. The task

In this take-home exercise, I will apply appropriate data visualization techniques to create a data visualization to segment kid drinks and other by nutrition indicators. For the purpose of this task, starbucks_drink.csv will be used.

2. Data challenges

3. Proposed visualization

As mentioned in the challenges section, there are 2 goals we want to achieve through data visualization. We will use a correlation plot to understand the relationships between the different nutrition indicators, and use a heatmap to understand the amount of nutrition in each of the drinks type, to see if a certain drink contain significant amount of specific nutrition such as caffeine or sugars.

4. Installing/loading required packages

For this task, we will be installing and launching seriation, dendextend, heatmaply, tidyverse and corrplot.

packages = c('seriation', 'dendextend', 'heatmaply', 'tidyverse', 'corrplot','readr','dplyr')

for(p in packages){library
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

5. Data import and wrangling

5.1 Data import

We will first import the data set “starbuck_drinks.csv”. Since it is in csv format, we will use read_csv() to load it.

sb <- read_csv("data/starbucks_drink.csv")

5.2 Data wrangling

For this task, only the “kids drinks and other” category from the data set will be used. Hence, filter() function is used to extract relevant rows.

sb_kids <- sb %>%
  filter(Category == "kids-drinks-and-other")

By examine the filtered data set, we see that for “steamed apple juice”, the options for “Milk” and “Whipped Cream” are not applicable, indicated by blanks. Hence, we need to change all the blank cells to “NA” first, as shown in the following code chunk.

sb_kids[is.na(sb_kids)] <- "NA"

Then, we will change the NA rows under “Milk” to “No Milk”, and NA rows under “Whipped Cream” to “No Whipped Cream”.

sb_kids$Milk[sb_kids$Milk == "NA"] <- "No Milk"
sb_kids$`Whipped Cream`[sb_kids$`Whipped Cream` == "NA"] <- "No Whipped Cream"

Next, we will concatenate the drink types with milk and whipped cream choices using the unite function.

sb_kids_unite <- sb_kids %>%
  unite(Drink, c("Name", "Milk","Whipped Cream"))

As mentioned in the data set challenges, “Caffeine(mg)” is not in the correct data type, hence we will change the data type of caffeine from char to numeric as shown below.

sb_kids_unite$`Caffeine(mg)` = as.numeric(as.character(sb_kids_unite$`Caffeine(mg)`))

Finally, we will group the drinks types together and calculate the average nutrition in each drink by dividing the total nutrition by the sum of portion(oz).This is to ensure that our result is not biased by the portions, i.e. large portion of a drink type may have more calories and sugars.

sb_kids_grouped <- sb_kids_unite %>%
  group_by(`Drink`) %>%
  summarise('Calories' = sum(`Calories`)/sum(`Portion(fl oz)`),
           'Calories from fat'  = sum(`Calories from fat`)/sum(`Portion(fl oz)`),
           'Total Fat(g)' = sum(`Total Fat(g)`)/sum(`Portion(fl oz)`),
           'Saturated fat(g)' = sum(`Saturated fat(g)`)/sum(`Portion(fl oz)`),
           'Trans fat(g)' = sum(`Trans fat(g)`)/sum(`Portion(fl oz)`),
           'Cholesterol(mg)' = sum(`Cholesterol(mg)`)/sum(`Portion(fl oz)`),
           'Sodium(mg)' = sum(`Sodium(mg)`)/sum(`Portion(fl oz)`),
           'Total Carbohydrate(g)' = sum(`Total Carbohydrate(g)`)/sum(`Portion(fl oz)`),
           'Dietary Fiber(g)' = sum(`Dietary Fiber(g)`)/sum(`Portion(fl oz)`),
           'Sugars(g)' = sum(`Sugars(g)`)/sum(`Portion(fl oz)`),
           'Protein(g)' = sum(`Protein(g)`)/sum(`Portion(fl oz)`),
           'Caffeine(mg)' = sum(`Caffeine(mg)`)/sum(`Portion(fl oz)`)) %>%
  ungroup()

6. Data visualization

6.1 Coorplot

Firstly, we need to compute the correlation matrix of the data frame using cor() of R Stats.

sb_kids_grouped.cor <- cor(sb_kids_grouped[,2:13])

Next, corrplot() is used to plot the corrgram, visual geometrics and layout settings are included to finalize the visualization:

corrplot(sb_kids_grouped.cor,
         type = "lower",
         addCoef.col = "black",
         method = "ellipse",
         diag = FALSE,
         tl.col = "black",
         tl.cex = 1,
         number.cex = 0.8,
         col = COL2('RdYlBu'))

6.1 Heatmap

Firstly, we need to change the rows by country name instead of row number by using the code chunk below

row.names(sb_kids_grouped) <- sb_kids_grouped$Drink

The data was loaded into a data frame, but it has to be a data matrix to make the heatmap.

The code chunk below will be used to transform the data frame into a data matrix.

sb_kids_matrix<- data.matrix(sb_kids_grouped)

In order to determine the best clustering method and number of cluster the dend_expend() and find_k() functions of dendextend package will be used.

First, the dend_expend() will be used to determine the recommended clustering method to be used.

sb_kids_matrix_d <- dist(normalize(sb_kids_matrix[, -c(1)]), 
                         method = "euclidean")
dend_expend(sb_kids_matrix_d)[[3]]
  dist_methods hclust_methods     optim
1      unknown         ward.D 0.5614832
2      unknown        ward.D2 0.6088735
3      unknown         single 0.6646756
4      unknown       complete 0.6243221
5      unknown        average 0.7387914
6      unknown       mcquitty 0.6958625
7      unknown         median 0.5369151
8      unknown       centroid 0.6061457

The output table shows that “average” method should be used because it gave the high optimum value.

Next, find_k() is used to determine the optimal number of clusters.

sb_kids_matrix_clust <- hclust(sb_kids_matrix_d,
                               method = "average")
num_k <- find_k(sb_kids_matrix_clust)
plot(num_k)

Figure above shows that k=10 would be good.

With reference to the statistical analysis results, we can prepare the code chunk as shown below.

heatmaply(normalize(sb_kids_matrix[,-c(1)]),
          dist_method = "euclidean",
          hclust_method = "average",
          seriate = "none",
          k_row = 10,
          colors = Purples,
          margins = c(NA,70,60,NA),
          fontsize_row = 7,
          fontsize_col = 8,
          xlab = "Nutrition Indicators",
          ylab = "Drink type by milk and whipped cream",
          main="Starbucks(kids and other drinks) nutrition indicator by milk and whipped cream types \nDataTransformation using Normalise Method",
          Colv = NA
          )

7. Conclusion

From the correlation plot, we see that most of the nutrition indicators are positively correlated with each other. Some of them have strong positive correlations, such as between “Total Fat” and “Calories from fat”, “Saturated fat” with “Calories from fat” and “Total Fat”, “Sugars” with “Calories” “Sodium” and “Total Carbohydrate”, and interestingly “Caffeine” is highly positively correlated with “Dietary Fiber”.

From the heatmap, we can see that “Salted Caramel Hot Chocolate” with whipped cream and milk added is less recommended as it has high calories, total fat, cholesterol and sugars. Different types of milk options also contribute to the varying nutrition indicators. This drink type is less recommended to consume, potentially leading to diabetes, overweight and high cholesterol.

Moreover, drinks with whipped cream and milk add-ons (except nonfat milk) tend to have higher calories from fat. Interestingly, drinks with coconut milk added tend to have significantly high level of saturated fat among the other types of milk.

Drinks with coconut and almond have less protein levels as compared to drinks with other types of milk and soy. This can be recommended for kids choices as it helps with their growth.

Hot chocolate drinks and Vanilla Creme drinks are with significant caffeine levels as compared to other drinks, regardless of whipped cream and milk type. According to the United States Department of Agriculture, the darker the chocolate, the more amount of caffeine it contains per ounce. These drinks can be reduced for kids to consume extensively.

Overall, for the kids and other drinks segment, drinks with whipped cream and coconut are less recommended as they contain high level of fat and cholesterol. Steamed apple juice is more recommended for kids as it is more natural and healthy.